Section 1: Simple Regression
Libraries
library(Statamarkdown) # This is only necessary for me
## Stata found at C:/Program Files/Stata16/StataMP-64.exe
## Warning in readLines("profile.do"): incomplete final line found on 'profile.do'
## Found an existing 'profile.do'
## cd C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regression
##
## use flats.dta
## The 'stata' engine is ready to use.
library(ggplot2) # Necessary for plotting data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pprint)
Loading The Dataset
R
- Before we do anything we need to open our dataset
- I have saved an .RDS file for R-users.
- .RDS files store a single R-object
- This can also be accessed directly by loading the astatur package and calling the object “flats”
# Changing working directory
setwd("C:/Users/slupp/OneDrive/Skrivebord/NTNU/Mehmet/PSY8003/Simple_and_multiple_regression")
# Reading the dataframe and storing it in an object named df
df <- readRDS("flats.rds")
# Viewing the dataframe
View(df)
STATA
- I have stored the same dataset as a .dta file
// setting working directory
cd C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regression
// Loading the dataset
use flats.dta
// Viewing the dataset
browse
C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regr
> ession
request ignored because of batch mode
Getting an Overview of the Dataset
R
# Base R:
summary(df)
flat_price location floor_size year_built
Min. : 225000 centre:34 Min. : 20.00 Min. :1871
1st Qu.: 398333 south : 9 1st Qu.: 58.00 1st Qu.:1998
Median : 498333 west :11 Median : 74.00 Median :2006
Mean : 529383 east :41 Mean : 77.37 Mean :2001
3rd Qu.: 588750 3rd Qu.: 88.00 3rd Qu.:2010
Max. :1816667 Max. :228.00 Max. :2013
NA's :16
energy_efficiency
best :10
mediocre:48
poor :24
NA's :13
str(df)
tibble [95 × 5] (S3: tbl_df/tbl/data.frame)
$ flat_price : num [1:95] 664167 225000 453333 446667 331667 ...
$ location : Factor w/ 4 levels "centre","south",..: 1 1 1 1 1 1 1 1 1 1 ...
$ floor_size : num [1:95] 88 20 84 58 46 37 57 129 152 51 ...
..- attr(*, "label")= chr "størrelsen på leiligheten"
..- attr(*, "format.stata")= chr "%8.0g"
$ year_built : num [1:95] 1871 2005 1922 NA 2010 ...
..- attr(*, "label")= chr "året leiligheten er byggd"
..- attr(*, "format.stata")= chr "%8.0g"
$ energy_efficiency: Factor w/ 3 levels "best","mediocre",..: NA 2 3 3 3 2 2 2 3 2 ...
# If you have installed the pprint package
psummary(df)
# Color data frame (class colorDF) 8 x 5:
│Col │Class│NAs │unique│Mean │SD │Min │Max
1│flat_price │<dbl>│ 0│ 55│529382.5│222449.6│225000│1816667
2│location │<fct>│ 0│ 4│ NA│ NA│ NA│ NA
3│floor_size │<dbl>│ 0│ 56│ 77.4│ 32.6│ 20│ 228
4│year_built │<dbl>│ 16│ 26│ 2000.9│ 19.4│ 1871│ 2013
5│energy_efficiency│<fct>│ 13│ 3│ NA│ NA│ NA│ NA
STATA
codebook
sum
flat_price (unlabeled)
-------------------------------------------------------------------------------
type: numeric (double)
range: [225000,1816666.6] units: .01
unique values: 55 missing .: 0/95
mean: 529382
std. dev: 222450
percentiles: 10% 25% 50% 75% 90%
325000 398333 498333 595833 798333
-------------------------------------------------------------------------------
location (unlabeled)
-------------------------------------------------------------------------------
type: numeric (long)
label: location
range: [1,4] units: 1
unique values: 4 missing .: 0/95
tabulation: Freq. Numeric Label
34 1 centre
9 2 south
11 3 west
41 4 east
-------------------------------------------------------------------------------
floor_size størrelsen på leiligheten
-------------------------------------------------------------------------------
type: numeric (double)
range: [20,228] units: 1
unique values: 56 missing .: 0/95
mean: 77.3684
std. dev: 32.6438
percentiles: 10% 25% 50% 75% 90%
46 58 74 88 113
-------------------------------------------------------------------------------
year_built året leiligheten er byggd
-------------------------------------------------------------------------------
type: numeric (double)
range: [1871,2013] units: 1
unique values: 26 missing .: 16/95
mean: 2000.89
std. dev: 19.3934
percentiles: 10% 25% 50% 75% 90%
1993 1997 2006 2010 2012
-------------------------------------------------------------------------------
energy_efficiency (unlabeled)
-------------------------------------------------------------------------------
type: numeric (long)
label: energy_efficiency
range: [1,3] units: 1
unique values: 3 missing .: 13/95
tabulation: Freq. Numeric Label
10 1 best
48 2 mediocre
24 3 poor
13 .
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
flat_price | 95 529382.5 222449.6 225000 1816667
location | 95 2.621053 1.354103 1 4
floor_size | 95 77.36842 32.64381 20 228
year_built | 79 2000.886 19.39336 1871 2013
energy_eff~y | 82 2.170732 .624695 1 3
Running a simple regression
R
- Regressions are done using the lm() function
- The syntax is:
# lm(formula, data, subset, weights, na.action,
# method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
# singular.ok = TRUE, contrasts = NULL, offset, ...)
The most important arguments are formula and data
The formula argument is derived from the mathemeatical specification of your model where “=” is substituded with “~”
\(\hat{Y}_i \sim \hat{\beta_0} + \hat{\beta_1}X_i\)
Since we haven’t estimated \(\hat{\beta}_0 + \hat{\beta}_1\) , we leave them out of the equation
We’re then left with: \(\hat{Y}_i \sim X_i\)
The similarity between the mathematical formula, and the formula used in the lm() function will become more apparent when we look at more complex models (e.g., multiple- and interaction regression)
Lets say we want to see how well floor size predicts flat price
# Using named arguments and storing the model as an object
reg_price_size <- lm(formula = flat_price ~ floor_size,
data = df)
# Using positional arguments
reg_price_size <- lm(flat_price ~ floor_size, df)
# A little tip:
# When calling the functions this way R cannot autocomplete for you
# But if we use the pipe operator from the margarittr/dplyr package
# We can get our variable_names autocompleted
reg_price_size <- df %>% lm(data =, #%>% pipes to the first argument
formula = flat_price ~ floor_size)
reg_price_size
Call:
lm(formula = flat_price ~ floor_size, data = .)
Coefficients:
(Intercept) floor_size
121356 5274
- Note that we do not get a lot of information by just calling the model it self
- To fix this we can either us the base function summary() or the preg() function from pprint
summary(reg_price_size)
Call:
lm(formula = flat_price ~ floor_size, data = .)
Residuals:
Min 1Q Median 3Q Max
-234582 -70624 -8154 42566 1014989
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121356.1 37547.9 3.232 0.0017 **
floor_size 5273.8 447.5 11.785 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141600 on 93 degrees of freedom
Multiple R-squared: 0.5989, Adjusted R-squared: 0.5946
F-statistic: 138.9 on 1 and 93 DF, p-value: < 2.2e-16
preg(reg_price_size)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│2.79e+12│ 1│F(1, 93)│138.89│ R-squared│5.99e-01
Residual│1.87e+12│ 93│prob > F│ 4e-20│adj R-squared│5.95e-01
Total│4.65e+12│ 94│ N│ 95│ MSE│1.42e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 2:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 121356│ 37548│ 3.23│ 0.002│ 46793│ 195919
floor_size│ 5274│ 447│ 11.79│ <2e-16│ 4385│ 6162
STATA
- In STATA it is a bit simpler
- There we can use the regress command (abbreviation: reg)
- Syntax: regress depvar [indepvars] [if] [in] [weight] [, options]
- Since each STATA-session only uses a single dataset we do not need to specify the dataset
- Further the reg command automatically summarizes the model
reg flat_price floor_size
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(1, 93) = 138.89
Model | 2.7860e+12 1 2.7860e+12 Prob > F = 0.0000
Residual | 1.8655e+12 93 2.0059e+10 R-squared = 0.5989
-------------+---------------------------------- Adj R-squared = 0.5946
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 1.4e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 5273.811 447.4979 11.79 0.000 4385.169 6162.453
_cons | 121356.1 37547.91 3.23 0.002 46793.35 195918.8
------------------------------------------------------------------------------
Interpretation
Does anyone want to interpret the results?
cat(interpretation_reg_price_size)
Error in eval(expr, envir, enclos): object 'interpretation_reg_price_size' not found
Plotting a Simple Regression
R
- We can use base-functions to make both 2D and 3D plots, but for plotting 2D graphs i prefer using ggplot(2)
- We use the ggplot function
- When using ggplot we must supply our dataframe
- ggplot works by layering graphics onto each other.
- Each ggplot builds upon a mapping, which most often is called by aes(x = x-axis, y = y-axis)
- We add layers on top using the + operator
- Here we use geom_smooth() to add a function summarizing our data
library(ggplot2)
ggplot(data = df,
mapping = aes(x = floor_size,
y = flat_price)) +
geom_smooth(method = "lm", se = F) #se = F removes error estimates
`geom_smooth()` using formula = 'y ~ x'
We can improve the plot by adding participant values using geom_point()
ggplot(data = df,
mapping = aes(x = floor_size,
y = flat_price)) +
geom_smooth(method = "lm", se = F) +
geom_point()
`geom_smooth()` using formula = 'y ~ x'
We can also add the baseline model (i.e., the mean) using hline() (horizontal line)
ggplot(data = df,
mapping = aes(x = floor_size,
y = flat_price)) +
geom_smooth(method = "lm", se = F) +
geom_point() +
geom_hline(yintercept = mean(df$flat_price),
color = "red",
linewidth = 1)
`geom_smooth()` using formula = 'y ~ x'
We can customize labels using the labs() argument
ggplot(data = df,
mapping = aes(x = floor_size,
y = flat_price)) +
geom_smooth(method = "lm", se = F) +
geom_point() +
geom_hline(yintercept = mean(df$flat_price),
color = "red",
linewidth = 1) +
labs(x = "Flat Price $", y = "Floor Size sqm")
`geom_smooth()` using formula = 'y ~ x'
STATA
- We can achieve the same in STATA using the twoway command.
- The twoway command builds upon the plot command, but allows layering of different plots
- It is a bit more restricted than ggplot, so i haven’t found a very elegant (while being easy) solution
Basic Regression Plot
twoway lfit flat_price floor_size
Adding data points
twoway (lfit flat_price floor_size) (scatter flat_price floor_size)
Adding a mean line
// Creating function which is mean of flatprice
egen mean_var = mean(flat_price)
// Plot
twoway (lfit flat_price floor_size) (scatter flat_price floor_size) (lfit mean_var floor_size)
Exercises
- Run a regression between flat price and year built
- Interpret the model as a whole, and the coefficients
- Create a graph representing your model
Section 2: Multiple Regression
R
- We use the exact same function for a multiple regression, as a simple regression
- As in the mathematical formula, we simply add a variable using the + operator
- $Y_i \sim \beta_1 X_{1i} + \beta_2X_{2i} ... \beta_kX_{ki} $
reg_price_size_year <- df %>% lm(data =,
formula = flat_price ~ floor_size + year_built)
summary(reg_price_size_year)
Call:
lm(formula = flat_price ~ floor_size + year_built, data = .)
Residuals:
Min 1Q Median 3Q Max
-209201 -68960 -17571 55959 1001158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1153762.9 1762246.8 -0.655 0.515
floor_size 5367.4 506.2 10.604 <2e-16 ***
year_built 640.0 878.9 0.728 0.469
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 150100 on 76 degrees of freedom
(16 observations deleted due to missingness)
Multiple R-squared: 0.5967, Adjusted R-squared: 0.5861
F-statistic: 56.22 on 2 and 76 DF, p-value: 1.032e-15
preg(reg_price_size_year, std_beta = T)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│2.53e+12│ 2│F(2, 76)│56.22│ R-squared│5.97e-01
Residual│1.71e+12│ 76│prob > F│1e-15│adj R-squared│5.86e-01
Total│4.25e+12│ 78│ N│ 79│ MSE│1.50e+05
Coefficients: flat_price
# Color data frame (class colorDF) 7 x 3:
│ Coefficients│std.beta│ std.error│ t-value│ p-value│ CI[2.5%]
(Intercept)│ -1153763│ 0.000│ 1762247│ -0.655│ 0.515│ -4663582
floor_size│ 5367│ 0.775│ 506│ 10.604│ <2e-16│ 4359
year_built│ 640│ 0.053│ 879│ 0.728│ 0.469│ -1111
│ CI[97.5%]
(Intercept)│ 2356056
floor_size│ 6376
year_built│ 2391
STATA
reg flat_price floor_size year_built
Source | SS df MS Number of obs = 79
-------------+---------------------------------- F(2, 76) = 56.22
Model | 2.5332e+12 2 1.2666e+12 Prob > F = 0.0000
Residual | 1.7121e+12 76 2.2528e+10 R-squared = 0.5967
-------------+---------------------------------- Adj R-squared = 0.5861
Total | 4.2453e+12 78 5.4427e+10 Root MSE = 1.5e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 5367.429 506.182 10.60 0.000 4359.28 6375.577
year_built | 640.0368 878.9464 0.73 0.469 -1110.537 2390.61
_cons | -1153763 1762247 -0.65 0.515 -4663582 2356056
------------------------------------------------------------------------------
Anybody want to intepret the results?
cat(interpretation_reg_price_size_year)
Error in eval(expr, envir, enclos): object 'interpretation_reg_price_size_year' not found
Plotting Multiple Regressions
- When we have a multiple regression we can no longer represent it as is, in two dimensions.
- If it has only two predictors, we can represent it as a three dimensional plot
# install.packages("plot3D")
library(plot3D)
grid_size <- 30
# NA's cause problems since it affects the surface and scatter object differently
df2 <- na.omit(df)
reg_price_size_year <- df2 %>% lm(data =,
formula = flat_price ~ floor_size + year_built)
# Creating a grid representing our regression plane
seq_floor_size <- seq(min(df2$floor_size, na.rm = T),
max(df2$floor_size, na.rm = T),
length.out = grid_size)
seq_year_built <- seq(min(df2$year_built, na.rm = T),
max(df2$year_built, na.rm = T),
length.out = grid_size)
year_size <- expand.grid(floor_size = seq_floor_size,
year_built = seq_year_built)
predicted_price <- predict(reg_price_size_year,
newdata = year_size) |>
matrix(ncol = grid_size, nrow = grid_size)
fitpoints <- predict(reg_price_size_year)
scatter3D(x = df2$floor_size,
y = df2$year_built,
z = df2$flat_price,
# Regression plane
surf = list(x = seq_floor_size,
y = seq_year_built,
z = predicted_price,
# Residuals
fit = fitpoints,
# Surface lines
facets = NA,
col = "black"),
# Plot Settings
pch = 18,
ticktype = "detailed",
bty = "g",
col = "black",
# Viewpoint
theta = 20,
phi = 15)
# install.packages("plot3D")
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
grid_size <- 30
# Creating a grid representing our regression plane
seq_floor_size <- seq(min(df2$floor_size, na.rm = T),
max(df2$floor_size, na.rm = T),
length.out = grid_size)
seq_year_built <- seq(min(df2$year_built, na.rm = T),
max(df2$year_built, na.rm = T),
length.out = grid_size)
year_size <- expand.grid(floor_size = seq_floor_size,
year_built = seq_year_built)
predicted_price <- matrix(predict(reg_price_size_year,
newdata = year_size),
ncol = grid_size,
nrow = grid_size,
byrow = TRUE)
xyz <- cbind(seq_floor_size,
seq_year_built,
predict(reg_price_size_year, newdata = year_size))
fitpoints <- predict(reg_price_size_year)
plot_ly(data = df2,
type = "scatter3d",
#mode = "markers",
x = ~floor_size,
y = ~year_built,
z = ~flat_price,
color = ~location,
size = I(150)) %>%
add_surface(x = seq_floor_size,
y = seq_year_built,
z = predicted_price,
opacity = .3)
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
A simpler version:
#install.packages("margins")
library(margins)
reg_price_size_year <- lm(flat_price ~ floor_size + year_built, data = df) #this must be called this way, for the persp function to work
persp(reg_price_size_year,
xvar = "floor_size",
yvar = "year_built",
theta = -30,
phi = 15)
It is however easier using 2D graphs
library(margins)
# Making a dataframe with predictions at different combinations of values
margins_regression <- margins(reg_price_size_year,
at = list(floor_size = c(50,
100,
150,
200),
year_built = c(1970,
1990,
2000,
2010)))
# Converting year_built predictions to a factor, so it can be used as a categorical variable in ggplot
margins_regression <- mutate(margins_regression,
year_built = factor(year_built,
levels = c(1970,
1990,
2000,
2010),
labels = c("< 1970",
"< 1990",
"< 2000",
"< 2013")))
# Creating a categorical variable for year_built for observed values, so it can be used as a categorical variable in scatterplot
df_scatter <- df %>% mutate(year_built = case_when(year_built <= 1970 ~ "< 1970",
year_built <= 1990 ~ "< 1990",
year_built <= 2000 ~ "< 2000",
year_built <= 2013 ~ "< 2013"))
ggplot(margins_regression,
mapping = aes(x = floor_size,
y = fitted,
colour = year_built)) +
geom_smooth(method = "lm", se = F, fullrange = TRUE) +
geom_point(df_scatter,
mapping = aes(x = floor_size,
y = flat_price))
`geom_smooth()` using formula = 'y ~ x'
# Making a dataframe with predictions at different combinations of values
margins_regression <- margins(reg_price_size_year,
at = list(floor_size = c(50,
100,
150,
200),
year_built = c(1940,
1990,
2000,
2010)))
# Converting floor_size predictions to a factor, so it can be used as a categorical variable in ggplot
margins_regression <- mutate(margins_regression,
floor_size = factor(floor_size,
levels = c(50,
100,
150,
200),
labels = c("< 50",
"< 100",
"< 150",
"> 150")))
# Creating a categorical variable for floor_size for observed values, so it can be used as a categorical variable in scatterplot
df_scatter <- df %>% mutate(floor_size = case_when(floor_size <= 50 ~ "< 50",
floor_size <= 100 ~"< 100",
floor_size <= 150 ~ "< 150",
floor_size > 150 ~ "> 150"))
ggplot(margins_regression,
mapping = aes(x = year_built,
y = fitted,
colour = floor_size)) +
geom_smooth(method = "lm", se = F, fullrange = TRUE) +
# adding scatterplot (using) observed values
geom_point(df_scatter,
mapping = aes(x = year_built,
y = flat_price))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 16 rows containing missing values (`geom_point()`).
STATA
reg flat_price floor_size year_built
// X-axis = year_built
margins, at(year_built = (1930(50)2010) floor_size = (60(50)220))
marginsplot
// X-axis = floor_size
margins, at(floor_size = (60(50)220) year_built = (1930(30)2010))
marginsplot
Source | SS df MS Number of obs = 79
-------------+---------------------------------- F(2, 76) = 56.22
Model | 2.5332e+12 2 1.2666e+12 Prob > F = 0.0000
Residual | 1.7121e+12 76 2.2528e+10 R-squared = 0.5967
-------------+---------------------------------- Adj R-squared = 0.5861
Total | 4.2453e+12 78 5.4427e+10 Root MSE = 1.5e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 5367.429 506.182 10.60 0.000 4359.28 6375.577
year_built | 640.0368 878.9464 0.73 0.469 -1110.537 2390.61
_cons | -1153763 1762247 -0.65 0.515 -4663582 2356056
------------------------------------------------------------------------------
Adjusted predictions Number of obs = 79
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : floor_size = 60
year_built = 1930
2._at : floor_size = 60
year_built = 1980
3._at : floor_size = 110
year_built = 1930
4._at : floor_size = 110
year_built = 1980
5._at : floor_size = 160
year_built = 1930
6._at : floor_size = 160
year_built = 1980
7._at : floor_size = 210
year_built = 1930
8._at : floor_size = 210
year_built = 1980
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 403553.8 65859.33 6.13 0.000 272383.5 534724
2 | 435555.6 27030.97 16.11 0.000 381718.8 489392.4
3 | 671925.2 65373.07 10.28 0.000 541723.4 802126.9
4 | 703927 28966.64 24.30 0.000 646235 761619.1
5 | 940296.6 74100.79 12.69 0.000 792712.1 1087881
6 | 972298.5 47207.63 20.60 0.000 878276.3 1066321
7 | 1208668 89382.98 13.52 0.000 1030646 1386690
8 | 1240670 69993.97 17.73 0.000 1101265 1380075
------------------------------------------------------------------------------
Variables that uniquely identify margins: year_built floor_size
Adjusted predictions Number of obs = 79
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : floor_size = 60
year_built = 1930
2._at : floor_size = 60
year_built = 1960
3._at : floor_size = 60
year_built = 1990
4._at : floor_size = 110
year_built = 1930
5._at : floor_size = 110
year_built = 1960
6._at : floor_size = 110
year_built = 1990
7._at : floor_size = 160
year_built = 1930
8._at : floor_size = 160
year_built = 1960
9._at : floor_size = 160
year_built = 1990
10._at : floor_size = 210
year_built = 1930
11._at : floor_size = 210
year_built = 1960
12._at : floor_size = 210
year_built = 1990
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 403553.8 65859.33 6.13 0.000 272383.5 534724
2 | 422754.9 41350.41 10.22 0.000 340398.4 505111.3
3 | 441956 21745.02 20.32 0.000 398647 485264.9
4 | 671925.2 65373.07 10.28 0.000 541723.4 802126.9
5 | 691126.3 41825.44 16.52 0.000 607823.7 774428.9
6 | 710327.4 24812.9 28.63 0.000 660908.2 759746.6
7 | 940296.6 74100.79 12.69 0.000 792712.1 1087881
8 | 959497.7 55407.39 17.32 0.000 849144.3 1069851
9 | 978698.8 45162.1 21.67 0.000 888750.7 1068647
10 | 1208668 89382.98 13.52 0.000 1030646 1386690
11 | 1227869 75310.63 16.30 0.000 1077875 1377863
12 | 1247070 68881.43 18.10 0.000 1109881 1384260
------------------------------------------------------------------------------
Variables that uniquely identify margins: floor_size year_built
Exercises
- Perform a multiple regression with flat_price as a
dependent variable where independent variables are floor_size
and energy_efficiency
- NB: Note that energy_efficiency is coded as a nominal variable. In this exercise i want you to treat it as a continuous variable, so it must be recoded as such.
- I encourage you (Regardless of which program you use) to recode the variable on your own. If you struggle to do so, i have included a solution under [Appendix: Recoding energy_efficiency]
- Interpret the results
- Visualize your model in a plot (you decide what plot you want to use)
- Perform a multiple regression with flat_price as a dependent variable where independent variables are floor_size year_built and energy_efficiency
- Interpret the results
- Try to visualize this model in a twodimensional plot using the margins commands in STATA, or the margins package in R (NB this is a bit harder) I have included a (suggested) solution Appendix: Solution Exercise 6
Appendix
Appendix: Using ggeffects to visualize regressions.
Depending on the circumstances it can be a bit easier to use the ggeffects package, compared to the margins package, it is especially good for visualizing interaction effects.
#install.packages("ggeffects")
library(ggeffects)
prediction <- ggpredict(reg_price_size_year,
terms = c("year_built", "floor_size"))
df <- df %>% mutate(group = case_when(floor_size <= 46.8 ~ "46.8",
floor_size <= 77.4 ~ "77.4",
floor_size <= 108 ~ "108"))
prediction %>% ggplot(mapping = aes(x = x,
y = predicted,
colour = group)) +
geom_smooth(method = "lm",
formula = y ~ x,
se = F) +
# This adds a scatter plot, but is not necessary for visualizing the model
geom_point(data = df,
mapping = aes(x = year_built,
y = flat_price,
colour = group), inherit.aes = FALSE) +
ylab("Flat Price USD") +
xlab("Year Built")
Warning: Removed 16 rows containing missing values (`geom_point()`).
Appendix: Recoding energy_efficiency
R
df <- df %>%
mutate(i.energy_efficiency = case_when(energy_efficiency == "poor" ~ 1,
energy_efficiency == "mediocre" ~ 2,
energy_efficiency == "best" ~ 3))
# NB we could also use as.integer(energy_efficency), but this would cause the variable to be coded in reverse order (i.e., best = 1, and poor = 3)
STATA
// When recoding the variable this way, labels are removed and it becomes numeric. Note also that the variable is coded in reverse order
recode energy_efficiency (1 = 3) (2 = 2) (3 = 1)
reg(flat_price floor_size energy_efficiency)
(energy_efficiency: 34 changes made)
Source | SS df MS Number of obs = 82
-------------+---------------------------------- F(2, 79) = 64.10
Model | 2.5440e+12 2 1.2720e+12 Prob > F = 0.0000
Residual | 1.5677e+12 79 1.9844e+10 R-squared = 0.6187
-------------+---------------------------------- Adj R-squared = 0.6091
Total | 4.1116e+12 81 5.0761e+10 Root MSE = 1.4e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 5747.269 523.2311 10.98 0.000 4705.803 6788.734
energy_eff~y | 80118.73 25076.54 3.19 0.002 30205.12 130032.3
_cons | -64946.35 64265.62 -1.01 0.315 -192863.9 62971.16
------------------------------------------------------------------------------
// As with R there is a simpler soloution where we simple use c. (for continous) at the start of our variable when calling our regression. However the variable would come in reversed order.
reg(flat_price floor_size c.energy_efficiency)
> us) at the start of our variable when calling our regression. However the var
> iable would come in reversed order.
Source | SS df MS Number of obs = 82
-------------+---------------------------------- F(2, 79) = 64.10
Model | 2.5440e+12 2 1.2720e+12 Prob > F = 0.0000
Residual | 1.5677e+12 79 1.9844e+10 R-squared = 0.6187
-------------+---------------------------------- Adj R-squared = 0.6091
Total | 4.1116e+12 81 5.0761e+10 Root MSE = 1.4e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 5747.269 523.2311 10.98 0.000 4705.803 6788.734
energy_eff~y | -80118.73 25076.54 -3.19 0.002 -130032.3 -30205.12
_cons | 255528.6 68240.54 3.74 0.000 119699.2 391358
------------------------------------------------------------------------------
Appendix: Solution Exercise 6
R
reg_price_size_energy_year <- lm(flat_price ~ floor_size + energy_efficiency + year_built, data = df)
margins_regression <- margins(reg_price_size_energy_year,
at = list(floor_size = c(50,
100,
150,
200),
year_built = c(1970,
1990,
2000,
2010),
i.energy_efficiency = c(1:3)))
margins_regression <- mutate(margins_regression,
year_built = factor(year_built,
levels = c(1970,
1990,
2000,
2010),
labels = c("< 1970",
"< 1990",
"< 2000",
"< 2013")))
margins_regression <- mutate(margins_regression, # We could not just is it as a factor from the start, since it would change the model
i.energy_efficiency = factor(i.energy_efficiency,
levels = c(1:3),
labels= c("Poor",
"Mediocre",
"Best")))
margins_regression %>% ggplot(mapping = aes(x = floor_size,
y = fitted,
colour = year_built,
linetype = energy_efficiency) ) +
geom_point() +
geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'
STATA
// Regression flat_price ~ floor_size + energy_efficiency + year_built
reg flat_price floor_size c.energy_efficiency year_built
// Visualizing Regression using margins and marginsplot
margins, at(floor_size = (60(50)220) year_built = (1930(30)2010) energy_efficiency = (1(1)3))
marginsplot
Source | SS df MS Number of obs = 67
-------------+---------------------------------- F(3, 63) = 32.80
Model | 2.2832e+12 3 7.6106e+11 Prob > F = 0.0000
Residual | 1.4620e+12 63 2.3206e+10 R-squared = 0.6096
-------------+---------------------------------- Adj R-squared = 0.5910
Total | 3.7452e+12 66 5.6745e+10 Root MSE = 1.5e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 6018.797 615.7774 9.77 0.000 4788.265 7249.33
energy_eff~y | -74322.56 34854.55 -2.13 0.037 -143973.8 -4671.3
year_built | 1340.768 1483.367 0.90 0.370 -1623.504 4305.041
_cons | -2458863 2993242 -0.82 0.414 -8440380 3522655
------------------------------------------------------------------------------
Adjusted predictions Number of obs = 67
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : floor_size = 60
energy_eff~y = 1
year_built = 1930
2._at : floor_size = 60
energy_eff~y = 1
year_built = 1960
3._at : floor_size = 60
energy_eff~y = 1
year_built = 1990
4._at : floor_size = 60
energy_eff~y = 2
year_built = 1930
5._at : floor_size = 60
energy_eff~y = 2
year_built = 1960
6._at : floor_size = 60
energy_eff~y = 2
year_built = 1990
7._at : floor_size = 60
energy_eff~y = 3
year_built = 1930
8._at : floor_size = 60
energy_eff~y = 3
year_built = 1960
9._at : floor_size = 60
energy_eff~y = 3
year_built = 1990
10._at : floor_size = 110
energy_eff~y = 1
year_built = 1930
11._at : floor_size = 110
energy_eff~y = 1
year_built = 1960
12._at : floor_size = 110
energy_eff~y = 1
year_built = 1990
13._at : floor_size = 110
energy_eff~y = 2
year_built = 1930
14._at : floor_size = 110
energy_eff~y = 2
year_built = 1960
15._at : floor_size = 110
energy_eff~y = 2
year_built = 1990
16._at : floor_size = 110
energy_eff~y = 3
year_built = 1930
17._at : floor_size = 110
energy_eff~y = 3
year_built = 1960
18._at : floor_size = 110
energy_eff~y = 3
year_built = 1990
19._at : floor_size = 160
energy_eff~y = 1
year_built = 1930
20._at : floor_size = 160
energy_eff~y = 1
year_built = 1960
21._at : floor_size = 160
energy_eff~y = 1
year_built = 1990
22._at : floor_size = 160
energy_eff~y = 2
year_built = 1930
23._at : floor_size = 160
energy_eff~y = 2
year_built = 1960
24._at : floor_size = 160
energy_eff~y = 2
year_built = 1990
25._at : floor_size = 160
energy_eff~y = 3
year_built = 1930
26._at : floor_size = 160
energy_eff~y = 3
year_built = 1960
27._at : floor_size = 160
energy_eff~y = 3
year_built = 1990
28._at : floor_size = 210
energy_eff~y = 1
year_built = 1930
29._at : floor_size = 210
energy_eff~y = 1
year_built = 1960
30._at : floor_size = 210
energy_eff~y = 1
year_built = 1990
31._at : floor_size = 210
energy_eff~y = 2
year_built = 1930
32._at : floor_size = 210
energy_eff~y = 2
year_built = 1960
33._at : floor_size = 210
energy_eff~y = 2
year_built = 1990
34._at : floor_size = 210
energy_eff~y = 3
year_built = 1930
35._at : floor_size = 210
energy_eff~y = 3
year_built = 1960
36._at : floor_size = 210
energy_eff~y = 3
year_built = 1990
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 415625.4 123336.4 3.37 0.001 169157.2 662093.6
2 | 455848.5 82387.39 5.53 0.000 291210.4 620486.5
3 | 496071.5 48209.67 10.29 0.000 399732.2 592410.9
4 | 341302.9 109000.6 3.13 0.003 123482.5 559123.2
5 | 381525.9 65953.67 5.78 0.000 249728 513323.8
6 | 421749 27916.3 15.11 0.000 365962.7 477535.2
7 | 266980.3 104785.8 2.55 0.013 57582.49 476378.1
8 | 307203.3 65892.11 4.66 0.000 175528.5 438878.2
9 | 347426.4 40793.96 8.52 0.000 265906.2 428946.6
10 | 716565.3 123420.5 5.81 0.000 469929.1 963201.5
11 | 756788.3 83544.81 9.06 0.000 589837.4 923739.3
12 | 797011.4 51841.36 15.37 0.000 693414.7 900608.1
13 | 642242.7 108457.6 5.92 0.000 425507.6 858977.9
14 | 682465.8 66355.85 10.28 0.000 549864.2 815067.4
15 | 722688.8 31683.34 22.81 0.000 659374.7 786002.9
16 | 567920.2 103552.6 5.48 0.000 360986.8 774853.5
17 | 608143.2 65239.12 9.32 0.000 477773.2 738513.2
18 | 648366.3 41830.87 15.50 0.000 564773.9 731958.6
19 | 1017505 130955.3 7.77 0.000 755811.9 1279198
20 | 1057728 95224.45 11.11 0.000 867437.3 1248019
21 | 1097951 70333.41 15.61 0.000 957401.2 1238501
22 | 943182.6 116365.2 8.11 0.000 710645.3 1175720
23 | 983405.7 79700.82 12.34 0.000 824136.3 1142675
24 | 1023629 55895.06 18.31 0.000 911931.3 1135326
25 | 868860 111185.1 7.81 0.000 646674.4 1091046
26 | 909083.1 77887.26 11.67 0.000 753437.8 1064728
27 | 949306.1 61085.23 15.54 0.000 827237 1071375
28 | 1318445 144782.1 9.11 0.000 1029121 1607769
29 | 1358668 114243.5 11.89 0.000 1130371 1586965
30 | 1398891 95404.2 14.66 0.000 1208241 1589541
31 | 1244122 131204.4 9.48 0.000 981931.4 1506314
32 | 1284346 100981.4 12.72 0.000 1082550 1486141
33 | 1324569 84502.01 15.67 0.000 1155705 1493432
34 | 1169800 126083.4 9.28 0.000 917842.4 1421757
35 | 1210023 98856.54 12.24 0.000 1012474 1407572
36 | 1250246 87229 14.33 0.000 1075933 1424559
------------------------------------------------------------------------------
Variables that uniquely identify margins: floor_size year_built
energy_efficiency
Answers to interpretations (only used to call via cat-function)
interpretation_reg_price_size <- "1. Firstly we can see that the overall model is significant p < .001 \n
2. This means that there is less than 0.1% likelyhood to get the same results (or stronger) given no effect \n
3. Secondly we can see that the coefficient for floor_size is significant both in terms of p-value and confidence interval \n
4. The coefficient is 5274$, meaning that flat price on average increases with 5274$ per square meter increase in floor size. \n
5. The coefficient is significant p < .001, and CI does not overlap with 0 \n
6. The intercept/constant is 121356 meaning that we predict that a flat with 0 sqm floor size would (on average) cost 121 356 $. \n
7. Lastly we can see that the model (or rather floor_size) explains 60% of the variance in flat_prize"
interpretation_reg_price_size_year <- "1. Firstly we can see that the overall model is significant p < .001 \n
2. This means that there is less than 0.1% likelyhood to get the same results (or stronger) given no effect \n
3. Secondly we can see that the coefficient for floor_size is significant both in terms of p-value and confidence interval \n
4. We can also see that the model explains roughly 60% of the variance, and about 59% if penalized for adding an extra variable. \n
5. Notice that this is basicly the same as our previous model, suggesting that year_built (in our dataset) is not a strong predictor of flat price \n
6. The coefficient for floor_size is 5367 (p < .001), meaning that flat price on average increases with 5367$ per square meter increase in floor size, controlled for the effect of year_built \n
7. The coefficient for year_built is 640 (p = .469), then, it seeems that year built is not associated with flat price. If we were to interpet the coefficent however, we could say that flat price (on average) increases by 640$ per one year increase in buildyear. We could then say that the price decreases (on average) by 640$ per year old it is."